I want to share a take on grammar‑constrained generation I’ve been working on for a while.
This is a preliminary writeup: enough detail to communicate the core idea—use GLR/GSS for commit, and compute masks by running a precompiled weighted automaton over the stack—and timestamp it. I’ll likely follow up with a post on the less glamorous part: all the tricks you need to compile these automata without exploding time/memory.
You have a grammar (often a JSON Schema, or a language-ish CFG) and you want the LLM to produce grammatically valid output.
At each decoding step, the LLM outputs logits over the vocabulary (V). You sample a token, append it to the output, and repeat.
Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.
So we maintain a constraint state and, for each step, compute a mask (M \subseteq V) of allowed tokens, and apply it to the logits.
No more broken JSON.
Think of constrained decoding as two functions:
get_mask(state) returns a mask (M \subseteq V) over the LLM vocabulary.commit(state, token) updates the constraint state with the chosen token.At runtime the loop looks like:
loop:
parallel:
GPU: logits
CPU: mask
apply mask
sample token
commit token to model and constraint
LLM inference is batched: many user requests share the GPU. Mask generation sits directly on the critical per-token decoding path.
If you want ~1k tokens/sec, you have ~1ms to produce a mask. And if you miss even a single one—if a mask takes 2ms instead of 1ms—it screws up scheduling. The GPU waits on the straggler mask, and suddenly your p99 (or p99.9) latency is trashed.
Masks can’t be “fast on average.” They have to be fast every single step.
commit is incremental parsing (GLR + GSS)The commit op is basically incremental parsing: you’re incrementally parsing a growing prefix.
Many battle‑tested incremental parsers (tree‑sitter, Lezer, etc.) use Generalized LR (GLR) parsing with a graph‑structured stack (GSS):
Glossary
--, NUMBER, IDENT.Now you have a parse state (often a GSS), and you want a mask over LLM tokens.
A straightforward approach is:
For each LLM token (v \in V), check whether appending it keeps the parse valid.
But you don’t append an LLM token to the parser. You append terminal symbols.
And a single LLM token is a byte string which might:
"}\n" could lex as RBRACE NEWLINE),true),So “try token (v)” is really “try every way (v)’s bytes could lex into some terminal sequence(s), then try advancing the parser for those terminals.”
That’s where runtime goes to die.
A common direction is:
This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.
But for general CFG-ish constraints (especially when you care about worst‑case), two things bite:
Even if your lexing side is deterministic (say you enforce longest‑match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.
In an LR parser, “shift terminal (t)” is often really:
Those reduction chains are table‑driven and fast in the happy path, but in GLR they can fan out. The operations are pointer‑heavy and cache‑unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.
If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively doing speculative GLR work for a huge number of hypothetical tokens on every step.
That’s exactly the kind of work you do not want on the critical path.
Even with tries, some tokens’ byte strings correspond to long sequences of grammar terminals. Sometimes those get quickly pruned. Sometimes they don’t.
Example: Python accepts monstrosities like:
Those sixteen hyphens may be tokenized (BPE) as a single LLM token (IDs just illustrative):
But that token lexes to 16 MINUS terminals:
If masking “tries” that token by actually feeding all 16 MINUS terminals through the parser, you’ve just forced 16 terminal‑level shifts (plus reductions) triggered by a single vocabulary entry. That’s pathological for worst‑case mask time.
You can transform grammars, memoize, prune, etc. But in my experience, if you keep “simulate parse work per candidate token” inside get_mask, you end up fighting worst‑case behavior forever.
Token validity depends on what’s on the stack. Instead of asking:
“does token (v) work on this stack?”
for each (v \in V), ask:
“given this stack, which tokens (v) work?”
That reframing is the whole move. Now we want a data structure that implements “stack → allowed tokens” quickly and predictably.
That’s where the weighted automaton comes in.
Think of a finite automaton, except transitions carry “weights”, and traversing a path combines those weights.
Here:
So as you run the automaton on the current parse configuration:
This is a “vectorized membership test” idea:
If you fix a terminal sequence (\tau = t_1 t_2 \dots t_k), the set of LR stack configurations from which (\tau) can be consumed without error can be characterized by a finite-state device over parser states (this is closely related to the classical “viable prefixes are regular” result for LR parsing).
The weighted automaton is the operational way to use that fact at runtime without iterating over tokens.
For a single LR stack, the runtime looks like:
def get_mask(stack_state_ids_top_to_bottom):
# frontier: map automaton_state -> token_set
frontier = { A.start: ALL_TOKENS }
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # ∩ filter tokens
if tokens2.any():
new[a2] = new.get(a2, EMPTY) | tokens2 # ∪ merge paths
frontier = new
# optional early-exit: if nothing in `frontier` can be affected by deeper stack
if done(frontier):
break
return tokens_reaching_accepting(frontier)
Operationally, the inner loop is:
Critically:
The automaton is doing the “parser work” ahead of time; the runtime does only predictable set‑flow work.
For GLR, your parse configuration is a graph-structured stack (a DAG encoding many stacks sharing suffixes). You want:
a token is valid if it’s valid on any surviving parse path.
So you run essentially the same ∩/∪ propagation, but over a product of:
When GSS paths merge, you union token sets (since any path suffices). When automaton paths merge, you union too.
One practical implication: to make this fast, you don’t want to recompute “automaton-frontier maps” per path. You want a representation where GSS nodes can carry and merge accumulated values when nodes merge. (This is the part that becomes implementation-heavy; it’s doable, but the data structure matters.)
At this point “the weighted automaton” may still sound like a black box. The important thing is: it’s not runtime magic; it’s a compiled artifact.
What we want to precompute is:
given the current lexer+parser configuration (represented by active lexer states + a stack/GSS), which LLM tokens could extend the parse without immediately failing?
There are two distinct sources of complexity:
The structure we compile is essentially a composition of two automata families:
composed into a final deterministic weighted automaton (Parser DWA) that reads parser stack state IDs and outputs token sets directly.
An LLM token is a byte string. A lexer consumes a byte stream and emits grammar terminals, but in an incremental setting:
So the mapping “vocabulary token → terminals” is not a single lookup. What you really want is:
from a given lexer configuration (\ell), what terminal sequences could be emitted while consuming the bytes of a single vocabulary token (v)?
The Terminal DWA is a deterministic weighted automaton that compactly represents that relation:
A practical way to build it:
Important structural property: the Terminal DWA is acyclic.
This isn’t a deep theorem; it’s a consequence of the construction: you’re consuming a finite token from a finite trie. You can’t loop while consuming additional bytes of the same token. That acyclicity will matter later because it gives you a hard bound on the amount of “within-token” structure the compiled system can traverse.
Also note what I didn’t say: the Terminal DWA doesn’t need to tell you “the lexer state you end up in after the token.” For masking you only need to know whether there exists at least one valid lexing path for the token (and which terminal sequence(s) it could correspond to). The actual evolution of lexer ambiguity gets handled by commit.
Now suppose we have a grammar terminal (t). What does the parser do?
In LR parsing, “consume (t)” is generally:
and in GLR there can be multiple viable reduction sequences.
The key observation is: you can precompute, for each terminal (t), a finite-state object that reads a stack suffix (parser state IDs from top downward) and describes whether (t) can be legally processed (including any reduction behavior relevant to enabling that shift).
This idea is closely related to the reduction-graph style optimizations in Aycock & Horspool, “Even Faster Generalized LR Parsing” (2001): rather than “discover” reduction behavior by repeatedly simulating it on a pointer-heavy GSS, you structure it so the work is table-/automaton-driven.
When you have to compose “consume (t_1), then (t_2), …” you need to reason about how stack actions compose.
A convenient algebra for this is the polycyclic monoid:
At a high level: this lets you symbolically carry “what stack suffix is required” while composing templates, while aggressively simplifying via cancellation and pruning impossible compositions early via the (0) element.
This isn’t “cute algebra”—it’s the thing that makes the composition tractable, because it turns “does this sequence of terminals line up with some stack behavior?” into mechanical simplification + pruning.
If you allow arbitrary grammars, reduction behavior can contain cycles (intuitively: you can keep reducing in ways that don’t make progress on the relevant lookahead), which would imply “in principle you might need to inspect unbounded stack depth.”
In practice, for “programming-language-ish” grammars and for JSON-schema-derived grammars, you can normalize away the patterns that create cyclic reduction behavior. Aycock & Horspool’s analysis implies (and in my experience, you can make explicit) that after certain transformations, the per-terminal reduction/viability templates can be made acyclic.
I’ll be vague here because the transformation details are the engineering-heavy part I want to write up separately, but the punchline is:
Now we compose:
to get a Parser DWA that:
When a vocabulary token emits a terminal sequence (t_1 t_2 \dots t_k), the composed object is effectively chaining the stack-effects associated with each (t_i).
This is exactly where the polycyclic monoid representation earns its keep:
This pruning/cancellation is what lets you compile “a token that emits many terminals” into something that still only depends on a bounded stack suffix, rather than re-simulating a long reduction chain at runtime.
Two properties line up:
Under composition, you get a Parser DWA that is also acyclic.
And that has a very concrete runtime implication:
There exists a fixed depth (D) such that
get_masknever needs to inspect more than the top (D) parser states (per active lexer configuration), regardless of how deep the real parse stack is.
That’s the kind of “worst‑case bounded work” statement that actually matters for p99.9.
During compilation, templates need enough symbolic machinery to compose terminal sequences correctly—i.e., to ensure “requirements produced earlier in the sequence” match “requirements consumed later,” with cancellation and mismatch pruning.
But at runtime, get_mask doesn’t need to produce the next parser stack. It only needs to answer:
does there exist a legal way to consume this token next?
The real parser/lexer (in commit) is responsible for actually performing shifts/reductions and maintaining the GSS.
So once compilation is finished, you can project away the parts of the compiled state that were only needed for symbolic stack-effect composition, and keep a machine that is “read-only over the current stack”: it consumes stack state IDs, filters token sets, and produces the mask.
You might ask: “do you really scan the whole stack every step?”
No.
With an acyclic Parser DWA, there’s a hard bound (D) on how many stack symbols any accepting path can consume. So get_mask can look at:
You can still add an early-exit optimization on top:
But the important point isn’t “masks stabilize quickly”; it’s that the compiled structure gives you a worst‑case bound on the amount of stack information masking can depend on.
The term “token” is overloaded:
IDENT, "{", NUMBER, keywords, etc.).These don’t align.
Even with longest‑match lexing, longest‑match is forward‑looking: you can’t know a match is final until you see what comes next.
Classic example:
"+" followed by "+" should yield INCREMENT ("++"), not two PLUS.In a streaming setting, after you see the first "+", you’re in a state where:
PLUS,INCREMENT.Every incremental parser/lexer has to represent that uncertainty somehow. The standard move is: keep both hypotheses alive and prune later.
Operationally, my implementation does:
I sometimes call the provisional ones “inhibited terminals” as shorthand, but the underlying idea is not novel: it’s the usual incremental-lexing branch-and-prune pattern. It composes nicely with GLR+GSS, because “fork and later prune” is already the parser’s native operation.
Practically, the constraint state you carry around is not just “parser stack(s)”; it’s “(parser stack(s), active lexer configurations)”. That’s why get_mask is conditioned on both, and why the Parser DWA effectively has multiple start states (one per active lexer configuration).
commit(state, token) (parser work lives here)commit uses real lexer + GLR machinery:
This can still have pathological worst cases in theory (GLR is GLR), but for practical grammars it tends to behave well, and—importantly—you’re only doing it once per chosen token, not for thousands of hypothetical tokens.
get_mask(state) (mask work is “read-only set flow”)get_mask uses the precompiled weighted automaton:
So you get a clean split:
commit is responsible for advancing the real incremental parse state,get_mask is responsible for cheaply answering “which vocabulary tokens could be next?”Mask computation becomes:
Critically, at runtime:
Work scales with the size of the parse configuration you’re already maintaining (stack/GSS), not with vocabulary size or token byte weirdness.
All of this shifts cost to compile time.
Precompiling into the Parser DWA involves determinization, weight pushing, and aggressive simplification in a setting where weights are large token sets and merges do unions.
If you’re not careful, big grammars can blow up in memory/time. Getting compilation to behave took most of my engineering effort:
That’s probably the next post.
If you care about predictable p99.9 latency, the main takeaway is simple:
don’t make get_mask do speculative parsing—compile that speculation away, and make masking a bounded, cache-friendly walk of set intersections/unions over the existing parse structure.